Lesson 6. Spatial Queries

Spatial analysis is a process that begins with exploring and mapping a dataset and can lead to potentially complex models and visualizations of real world features and phenomena. Spatial queries are the building blocks of this analytical process. These queries are software operations that allow us to ask questions of our data and which return data metrics, subsets or new data objects. In this lesson we explore the two basic types of spatial queries: measurement queries and relationship queries.


Instructor Notes


Types of Spatial Queries

The basic types of spatial queries are:

  • Measurement queries
    • What is feature A’s length?
    • What is feature A’s perimeter?
    • What is feature A’s area?
    • What is feature A’s distance from feature B?
    • etc.
  • Relationship queries
    • Does feature A intersect with feature B?
    • Is feature A within feature B?
    • Does feature A cross feature B?
    • etc.

Both of these types of queries operate on the geometry of features in one or two datasets and are dependent on the type of geometry. For example, with point features you can make distance measurements or ask what points are spatially inside polygon objects. Polygon features, on the other hand, allow for a wider range of both measurement and spatial relationship queries.

An important distinction between these two types of queries is that measurement queries depend on the CRS of the data while spatial relationship queries do not. This is because topological relationships, the term used to describe spatial relationships, are invariant to rotation, translation and scaling transformations like those that CRS transformations entail.

Attribute vs. Spatial Queries

We already know how to do attribute queries with our data. For example, we can select one or more specific counties by name or select those counties where the total population is greater than 100,000 because we have these columns in the dataset.

Spatial queries are special because they are dynamic. For example, we can compute area from the geometry without it already being encoded or we can select BART stations in Berkeley even if city is not encoded in the BART data by linking those two spatial datasets in the same geographic space. This dynamic query capability is extremely powerful!

In this lesson we’ll work through examples of each of those types of queries.

Then we’ll try a very common spatial analysis method that is a conceptual amalgam of those two types: proximity analysis.

6.0 Load and prep the data

Load the libraries we will use.

library(sf)
library(tmap)

Read in the CA census tracts data and then take a look at its geometry and attributes.

census_tracts = st_read("notebook_data/census/Tracts/cb_2018_06_tract_500k.shp")
## Reading layer `cb_2018_06_tract_500k' from data source `/Users/pattyf/Documents/Dlab/workshops/2021/Geospatial-Fundamentals-in-R-with-sf/notebook_data/census/Tracts/cb_2018_06_tract_500k.shp' using driver `ESRI Shapefile'
## Simple feature collection with 8041 features and 9 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -124.4096 ymin: 32.53416 xmax: -114.1312 ymax: 42.00948
## Geodetic CRS:  NAD83
plot(census_tracts$geometry)

head(census_tracts)
## Simple feature collection with 6 features and 9 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -122.5009 ymin: 37.92535 xmax: -120.3345 ymax: 39.30875
## Geodetic CRS:  NAD83
##   STATEFP COUNTYFP TRACTCE             AFFGEOID       GEOID    NAME LSAD
## 1      06      009  000300 1400000US06009000300 06009000300       3   CT
## 2      06      011  000300 1400000US06011000300 06011000300       3   CT
## 3      06      013  303102 1400000US06013303102 06013303102 3031.02   CT
## 4      06      013  303202 1400000US06013303202 06013303202 3032.02   CT
## 5      06      013  303203 1400000US06013303203 06013303203 3032.03   CT
## 6      06      013  307102 1400000US06013307102 06013307102 3071.02   CT
##       ALAND AWATER                       geometry
## 1 457009794 394122 MULTIPOLYGON (((-120.764 38...
## 2 952744514 195376 MULTIPOLYGON (((-122.5001 3...
## 3   6507019      0 MULTIPOLYGON (((-121.7294 3...
## 4   3725528      0 MULTIPOLYGON (((-121.7235 3...
## 5   6354210      0 MULTIPOLYGON (((-121.7449 3...
## 6   1603153      0 MULTIPOLYGON (((-121.8226 3...

Select just the Alameda County census tracts.

census_tracts_ac = census_tracts[census_tracts$COUNTYFP=='001',]
plot(census_tracts_ac)

6.1 Measurement Queries

We’ll start off with some simple measurement queries.

We can get the area of each of our census tracts using thesf function st_area.

st_area(census_tracts_ac)[1:10]
## Units: [m^2]
##  [1] 1162918.0 1000842.5 1664804.5 3178120.0  814375.4 1251118.1 1034057.0
##  [8]  858325.5 1997754.4 1678617.1

Okay!

We got…

numbers!

…?

Question

  1. What do those numbers mean?
  2. What are the units?
  3. And if we’re not sure, how might be find out?

Let’s take a look at our CRS.

st_crs(census_tracts_ac)
## Coordinate Reference System:
##   User input: NAD83 
##   wkt:
## GEOGCRS["NAD83",
##     DATUM["North American Datum 1983",
##         ELLIPSOID["GRS 1980",6378137,298.257222101,
##             LENGTHUNIT["metre",1]]],
##     PRIMEM["Greenwich",0,
##         ANGLEUNIT["degree",0.0174532925199433]],
##     CS[ellipsoidal,2],
##         AXIS["latitude",north,
##             ORDER[1],
##             ANGLEUNIT["degree",0.0174532925199433]],
##         AXIS["longitude",east,
##             ORDER[2],
##             ANGLEUNIT["degree",0.0174532925199433]],
##     ID["EPSG",4269]]

Wow! We’re working with data that are in what is called an unprojected CRS. That means that the coordinates are latitude and longitude values and the units are decimal degrees. However, the sf::st_area function automatically returned area measurements in square meters (rather than in square degrees, which don’t make sense.)

How did it do this? Well, you can check out the help documentation for ?st_area for more information. If the data have a projected CRS, st_area uses Euclidean geometry to return area measurements in the units of the CRS. For an unprojected CRS, st_area calculates geodetic area on a curved surface model of the Earth and returns measurements in square meters. Pretty cool and pretty useful right?


That said, when doing spatial analysis, we will almost always want to convert all of our data to the same projected CRS since many spatial operations do not work with geographic CRSs.

Time to project! We’ll transform the data to the UTM Zone 10N, NAD83 CRS (EPSG 26910) which is appropriate for Northern California location data and is highly accurate for measurement queries for areas within the zone.

#Transform CRS of census tract data
census_tracts_ac_utm10 = st_transform(census_tracts_ac, 26910)

Now check it..especially look for the units.

st_crs(census_tracts_ac_utm10)
## Coordinate Reference System:
##   User input: EPSG:26910 
##   wkt:
## PROJCRS["NAD83 / UTM zone 10N",
##     BASEGEOGCRS["NAD83",
##         DATUM["North American Datum 1983",
##             ELLIPSOID["GRS 1980",6378137,298.257222101,
##                 LENGTHUNIT["metre",1]]],
##         PRIMEM["Greenwich",0,
##             ANGLEUNIT["degree",0.0174532925199433]],
##         ID["EPSG",4269]],
##     CONVERSION["UTM zone 10N",
##         METHOD["Transverse Mercator",
##             ID["EPSG",9807]],
##         PARAMETER["Latitude of natural origin",0,
##             ANGLEUNIT["degree",0.0174532925199433],
##             ID["EPSG",8801]],
##         PARAMETER["Longitude of natural origin",-123,
##             ANGLEUNIT["degree",0.0174532925199433],
##             ID["EPSG",8802]],
##         PARAMETER["Scale factor at natural origin",0.9996,
##             SCALEUNIT["unity",1],
##             ID["EPSG",8805]],
##         PARAMETER["False easting",500000,
##             LENGTHUNIT["metre",1],
##             ID["EPSG",8806]],
##         PARAMETER["False northing",0,
##             LENGTHUNIT["metre",1],
##             ID["EPSG",8807]]],
##     CS[Cartesian,2],
##         AXIS["(E)",east,
##             ORDER[1],
##             LENGTHUNIT["metre",1]],
##         AXIS["(N)",north,
##             ORDER[2],
##             LENGTHUNIT["metre",1]],
##     USAGE[
##         SCOPE["unknown"],
##         AREA["North America - 126°W to 120°W and NAD83 by country"],
##         BBOX[30.54,-126,81.8,-119.99]],
##     ID["EPSG",26910]]

Now let’s try our area calculation again.

st_area(census_tracts_ac_utm10)[1:10]
## Units: [m^2]
##  [1] 1162095.2 1000143.8 1663700.2 3176020.4  813846.5 1250301.3 1033393.0
##  [8]  857775.7 1996529.8 1677565.1

What if we compare areas calculated with our unprojected and projected CRS data?

# Using format to make for easier to read display
print(format(st_area(census_tracts_ac)[[1]], big.mark=','))
## [1] "1,162,918 [m^2]"
print(format(st_area(census_tracts_ac_utm10)[[1]], big.mark=","))
## [1] "1,162,095 [m^2]"

Hmmm… The numbers are a bit different…specifically…

format((st_area(census_tracts_ac)[[1]] - st_area(census_tracts_ac_utm10)[[1]]),digits=0, big.mark=',')
## [1] "823 [m^2]"

You may have noticed that our census tracts already have an area column in them.

Let’s also compare the calculated areas with the data in this column.

print(st_area(census_tracts_ac)[[1]])
## 1162918 [m^2]
print(st_area(census_tracts_ac_utm10)[[1]])
## 1162095 [m^2]
print(census_tracts$ALAND[1])
## [1] 457009794

Question

What explains the discrepancy? Which areas are correct? Which are incorrect?

Doing more

We can also calculate the area for Alameda county summing the areas of all census tracts.

sum(st_area(census_tracts_ac_utm10))
## 1943203171 [m^2]

We can look up how large Alameda County is to check our work.The county is 739 miles2, which is around 1,914,001,213 meters2. I’d say we’re pretty close!

# Sum the area of all Alameda county Census Tracts dynamically in square miles...
sum(units::set_units(st_area(census_tracts_ac_utm10),mi^2))
## 750.2749 [mi^2]

Calculating the area of all features and adding the output to the dataframe is a useful operation because it allows us to convert count variables like population to densities.

# Add the area of all Alameda County Census tracts to the data frame
census_tracts_ac_utm10$area_sqmi <-units::set_units(st_area(census_tracts_ac_utm10), mi^2)

# Check it by summing
print(sum(census_tracts_ac_utm10$area_sqmi))
## 750.2749 [mi^2]
# Take a look
head(census_tracts_ac_utm10,3)
## Simple feature collection with 3 features and 10 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 560194.2 ymin: 4174539 xmax: 575344.2 ymax: 4189173
## Projected CRS: NAD83 / UTM zone 10N
##    STATEFP COUNTYFP TRACTCE             AFFGEOID       GEOID    NAME LSAD
## 27      06      001  425101 1400000US06001425101 06001425101 4251.01   CT
## 28      06      001  428600 1400000US06001428600 06001428600    4286   CT
## 29      06      001  432600 1400000US06001432600 06001432600    4326   CT
##      ALAND  AWATER                       geometry        area_sqmi
## 27  590870 2045459 MULTIPOLYGON (((560340.9 41... 0.4486875 [mi^2]
## 28  898967 1080420 MULTIPOLYGON (((563419 4180... 0.3861577 [mi^2]
## 29 1673450       0 MULTIPOLYGON (((573362.1 41... 0.6423583 [mi^2]

You may be wondering how R is managing the units of our measurements.

It turns out that sf depends on the units package to track units.

This is super convenient! But there is a gotcha:

# convert to square kilometers
sum(st_area(census_tracts_ac_utm10)) / (1000^2)
## 1943.203 [m^2]

Oops! Our manual conversion to square kilometers gave us the right number but kept the now-wrong units!

Here’s the proper way to convert:

units::set_units(sum(st_area(census_tracts_ac_utm10)), km^2)
## 1943.203 [km^2]

Much nicer! In case you’re wondering how we knew the right abbreviation to use for kilometers, check out the leftmost column in this reference table:

# View(units::valid_udunits())

Calculating Length or Permeter

We can use the st_length operator in the same way to calculate the length or perimeter of features. Always take note of the output units!

st_length(census_tracts)[1:10]
## Units: [m]
##  [1] 125204.900 159644.736  11948.424   9673.456  12629.443   6430.248
##  [7]  10607.608   7853.356   6170.446   5183.838

Calculating Distance

The st_distance function can be used to find the pairwise distance between two sets of geometries.

st_distance(census_tracts_ac_utm10, census_tracts_ac_utm10)[1:5,1:5]
## Units: [m]
##           [,1]      [,2]      [,3]      [,4]      [,5]
## [1,]     0.000  6705.837 15753.864 17394.946 20876.861
## [2,]  6705.837     0.000  8984.300 10162.990 14244.463
## [3,] 15753.864  8984.300     0.000     0.000  3031.971
## [4,] 17394.946 10162.990     0.000     0.000  1135.945
## [5,] 20876.861 14244.463  3031.971  1135.945     0.000

You can also use it to find the distance between specific features.

# Identify my tracts of interest
mytracts = c('4201','4202','4203','4204')
# What is the distance between tract 4201 and all other tracts
st_distance(census_tracts_ac_utm10[census_tracts_ac_utm10$NAME=='4101',],
            census_tracts_ac_utm10[census_tracts_ac_utm10$NAME %in% mytracts,] )
## Units: [m]
##          [,1]     [,2]     [,3]     [,4]
## [1,] 19263.81 19220.14 18822.72 18946.69

6.2 Spatial Relationship Queries

Spatial relationship queries consider how two geometries or sets of geometries relate to one another in space. For example, you may want to know what schools are located within the City of Berkeley or what East Bay Regional Parks have land within Berkeley. You may also want to combine a measurement query with a spatial relationship query. Example, you may want to know the total length of freeways within the city of Berkeley.

Here is a list of some of the more commonly used sf spatial relationship operations.

  • st_intersects
  • st_within
  • st_contains
  • st_disjoint

These can be used to select features in one dataset based on their spatial relationship to another. In other works, you can use these operations to make spatial selections / create spatial subsets.

Enough talk. Let’s work through some examples.

What Alameda County Schools are in Berkeley?

First, load the CA Places data and select the city of Berkeley and save it to a sf dataframe.

places = st_read('notebook_data/census/Places/cb_2018_06_place_500k.shp')
## Reading layer `cb_2018_06_place_500k' from data source `/Users/pattyf/Documents/Dlab/workshops/2021/Geospatial-Fundamentals-in-R-with-sf/notebook_data/census/Places/cb_2018_06_place_500k.shp' using driver `ESRI Shapefile'
## Simple feature collection with 1521 features and 9 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -124.2694 ymin: 32.53416 xmax: -114.229 ymax: 41.99314
## Geodetic CRS:  NAD83
berkeley = places[places$NAME=='Berkeley',]
plot(berkeley$geometry)

Then, load the Alameda County schools data and make it a spatial dataframe.

schools_df = read.csv('notebook_data/alco_schools.csv')
schools_sf = st_as_sf(schools_df, 
                      coords = c('X','Y'),
                      crs = 4326)

Check that the two sf dataframes have the same CRS.

st_crs(schools_sf) == st_crs(berkeley)
## [1] FALSE

They don’t have the same CRS so we need to align them. Let’s transform (or reproject) the CRS of both of these dataframes to UTM10N, NAD83 (EPSG 26910). This is a commonly used CRS for Northern CA data.

# Transform data CRSs...
schools_utm10 <- st_transform(schools_sf, 26910)
berkeley_utm10 <- st_transform(berkeley, 26910)

If you look at the Schools data you will see that it has a City column. So we can subset the data by attribute to select the Schools in Berkeley. No need to do a spatial selection.

berkeley_schools = schools_utm10[schools_utm10$City=='Berkeley',]
dim(berkeley_schools)
## [1] 31  8

Confirm the results by plotting the data

plot(berkeley_utm10$geometry)
plot(berkeley_schools$geometry, add=T)

That looks good and was a relatively simple operation. But what if the schools data didn’t have that city column or if only some of the rows had a value in that column. How can we identify the schools in Berkeley spatially?

Here’s how!

# SPATIALLY select only the schools within Berkeley
berkeley_schools_spatial = schools_utm10[berkeley_utm10, , op=st_intersects]  #NO QUOTES!!!

Yes that was it! Take a long look at that simple yet powerful spatial selection syntax.

You should interpret that syntax as:

  • "Select all features (i.e. rows) in the schools_utm10 dataframe:

    • schools_utm10[berkeley_utm10, , op=st_intersects]
  • and all of the columns:

    • schools_utm10[berkeley_utm10 , , op=st_intersects]

    (all because the extraction brackets have no second argument)

  • whose geometry spatially intersects the Berkeley_utm10 geometry

    • schools_utm10[berkeley_utm10, , op=st_intersects]
Important

The op=st_intersects argument is optional because st_intersects is the default spatial selector.

To emphasize this, let’s rerun the last command

# SPATIALLY select only the schools within Berkeley
berkeley_schools_spatial = schools_utm10[berkeley_utm10, ]

What does spatiallly intersects mean?

Here’s one way to explain it.

Geometry A spatially intersects Geometry B if any of its parts (e.g., a point, line segment, or polygon) is equivalent to, touches, crosses, is contained by, contains, or overlaps any part of Geometry B.

So st_intersects is the mother of all spatial relationships! It is the most general and the most useful. However, you can specify any of those more specific spatial relationships by setting op= to any of the options listed in the ?st_intersects? help documentation.

Let’s check out the sf object that our selection returned.

# How many schools did we get
dim(berkeley_schools_spatial)
## [1] 32  8
# Map the results
plot(berkeley_utm10$geometry)
plot(berkeley_schools_spatial$geometry, add=T)

Interestingly, we have one more school in Berkeley based on the spatial selection!? Let’s take a look.

plot(berkeley_utm10$geometry)
plot(berkeley_schools_spatial$geometry, add=T)
plot(berkeley_schools$geometry,col="red", add=T)

Let’s use an interactive tmap to zoom into the school that was not selected by attribute but was spatially selected.

tmap_mode('view')
## tmap mode set to interactive viewing
tm_shape(berkeley_utm10) +
  tm_borders() +
tm_shape(berkeley_schools_spatial) +
 tm_dots(col="black", size=.3) +
 tm_shape(berkeley_schools) +
 tm_dots(col="red", size=.1)

IMPORTANT: The default spatial selection operator is st_intersects. If you want to use any other operator, it must be specified.

For example, we can use the st_disjoint operator to select only those schools NOT in Berkeley.

# Select all Alameda County Schools NOT in Berkeley with the disjoint operator
berkeley_schools_disjoint = schools_utm10[berkeley_utm10, , op=st_disjoint]

# Plot the result
plot(berkeley_schools_disjoint$geometry)
plot(berkeley_utm10, col=NA, border="red", add=T)
## Warning in plot.sf(berkeley_utm10, col = NA, border = "red", add = T): ignoring
## all but the first attribute

There is no need to memorize these spatial operators (aka predicates)! Here is a fantastic sf cheatsheet that lists and briefly explains all these common functions (and many more).


Protected Areas in Alameda County

Let’s load a new dataset, the CA Protected Areas Database (CPAD), to demonstrate these spatial queries in more detail.

This dataset contains all of the protected areas (parks and the like) in California.

We will use this data and the Alameda County Census Tract Data that we created earlier to ask “What protected areas are within Alameda County?”

First load the CPAD data.

cpad = st_read('./notebook_data/protected_areas/CPAD_2020a_Units.shp')
## Reading layer `CPAD_2020a_Units' from data source `/Users/pattyf/Documents/Dlab/workshops/2021/Geospatial-Fundamentals-in-R-with-sf/notebook_data/protected_areas/CPAD_2020a_Units.shp' using driver `ESRI Shapefile'
## Simple feature collection with 17068 features and 21 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -374984.2 ymin: -604454.8 xmax: 540016.3 ymax: 449743.2
## Projected CRS: NAD83 / California Albers

What is the CRS of the CPAD data?

Let’s transform the data to match census_tracts_ac_utm10.

cpad_utm10 = st_transform(cpad, st_crs(census_tracts_ac_utm10))

Let’s plot the data in so that we know what to expect. CPAD is big so wait for it…

plot(census_tracts_ac_utm10$geometry, col='grey', border="grey")
plot(cpad_utm10$geometry, col='green', add=T)

We can see from our map that some of the protected areas are completely within Alameda County, some of them overlap, and some are completely outside of the county. To get both of the “inside” and “overlaps” cases we use the st_intersects spatial selection operator, which is the default. Let’s check it out.

cpad_intersects = cpad_utm10[census_tracts_ac_utm10,]  #st_intersects
cpad_within = cpad_utm10[census_tracts_ac_utm10, , op=st_within] #st_within

We can use tmap to explore the difference in the results from st_intersects vs st_within

tmap_mode('view')
## tmap mode set to interactive viewing
tm_shape(census_tracts_ac_utm10)+
  tm_polygons(col="gray", border.col="grey") +
tm_shape(cpad_intersects) +
  tm_borders(col="green") +
tm_shape(cpad_within) +
  tm_borders(col='red')

What you can see from the above, is that by default, st_intersects returns the features that intersect but it does not clip the features to the boundary of Alameda County. For that, we would need to use a different spatial operation - st_intersection.

Let’s try it!

cpad_in_ac = st_intersection(cpad_utm10, census_tracts_ac_utm10)
## Warning: attribute variables are assumed to be spatially constant throughout all
## geometries

Great! Now, if we scroll the resulting sf object we’ll see that the COUNTY column of our resulting subset gives us a good sanity check on our results. Or does it?

table(cpad_in_ac$COUNTY)
## 
##      Alameda Contra Costa  San Joaquin  Santa Clara 
##          836           20            2            4

Always check your output - both the attribute table & the geometry!

head(cpad_in_ac)
## Simple feature collection with 6 features and 31 fields
## Geometry type: GEOMETRY
## Dimension:     XY
## Bounding box:  xmin: 560241 ymin: 4187290 xmax: 562089.7 ymax: 4189117
## Projected CRS: NAD83 / UTM zone 10N
##              ACCESS_TYP UNIT_ID                                UNIT_NAME
## 9474        Open Access   15869          McLaughlin Eastshore State Park
## 10131       Open Access   17626                              Marina Park
## 11568       Open Access   23663                           Davenport Park
## 12466 Restricted Access   28130          McLaughlin Eastshore State Park
## 12997 Restricted Access   46940 Emeryville Crescent State Marine Reserve
## 15136       Open Access   46002                         Point Emery Park
##       SUID_NMA AGNCY_ID                                    AGNCY_NAME AGNCY_LEV
## 9474     29199      204 California Department of Parks and Recreation     State
## 10131    21833     1098                           Emeryville, City of      City
## 11568    17939     1098                           Emeryville, City of      City
## 12466    29198      204 California Department of Parks and Recreation     State
## 12997    30564      204 California Department of Parks and Recreation     State
## 15136    29208     1098                           Emeryville, City of      City
##          AGNCY_TYP                                     AGNCY_WEB
## 9474  State Agency                      http://www.parks.ca.gov/
## 10131  City Agency http://www.ci.emeryville.ca.us/158/City-Parks
## 11568  City Agency http://www.ci.emeryville.ca.us/158/City-Parks
## 12466 State Agency                      http://www.parks.ca.gov/
## 12997 State Agency                      http://www.parks.ca.gov/
## 15136  City Agency http://www.ci.emeryville.ca.us/158/City-Parks
##                                               LAYER MNG_AG_ID
## 9474  California Department of Parks and Recreation      2032
## 10131                                          City      1098
## 11568                                          City      1098
## 12466 California Department of Parks and Recreation      2032
## 12997 California Department of Parks and Recreation      2032
## 15136                                          City      2032
##                            MNG_AGENCY       MNG_AG_LEV
## 9474  East Bay Regional Park District Special District
## 10131             Emeryville, City of             City
## 11568             Emeryville, City of             City
## 12466 East Bay Regional Park District Special District
## 12997 East Bay Regional Park District Special District
## 15136 East Bay Regional Park District Special District
##                      MNG_AG_TYP                               PARK_URL  COUNTY
## 9474  Recreation/Parks District http://www.ebparks.org/parks/eastshore Alameda
## 10131               City Agency                                   <NA> Alameda
## 11568               City Agency                                   <NA> Alameda
## 12466 Recreation/Parks District http://www.ebparks.org/parks/eastshore Alameda
## 12997 Recreation/Parks District                                   <NA> Alameda
## 15136 Recreation/Parks District                  http://www.ebarks.org Alameda
##          ACRES                               LABEL_NAME YR_EST     DES_TP
## 9474  1084.173                  McLaughlin Eastshore SP      0       <NA>
## 10131   10.322                              Marina Park      0 Local Park
## 11568    2.102                           Davenport Park      0 Local Park
## 12466  456.308                  McLaughlin Eastshore SP     NA       <NA>
## 12997   58.981 Emeryville Crescent State Marine Reserve   1985       <NA>
## 15136   20.687                         Point Emery Park      0 Local Park
##       GAP_STS STATEFP COUNTYFP TRACTCE             AFFGEOID       GEOID    NAME
## 9474     <NA>      06      001  425101 1400000US06001425101 06001425101 4251.01
## 10131       4      06      001  425101 1400000US06001425101 06001425101 4251.01
## 11568       4      06      001  425101 1400000US06001425101 06001425101 4251.01
## 12466    <NA>      06      001  425101 1400000US06001425101 06001425101 4251.01
## 12997    <NA>      06      001  425101 1400000US06001425101 06001425101 4251.01
## 15136       4      06      001  425101 1400000US06001425101 06001425101 4251.01
##       LSAD  ALAND  AWATER        area_sqmi                       geometry
## 9474    CT 590870 2045459 0.4486875 [mi^2] MULTIPOLYGON (((561544 4188...
## 10131   CT 590870 2045459 0.4486875 [mi^2] MULTIPOLYGON (((560255.1 41...
## 11568   CT 590870 2045459 0.4486875 [mi^2] POLYGON ((560870.4 4188045,...
## 12466   CT 590870 2045459 0.4486875 [mi^2] MULTIPOLYGON (((560376.8 41...
## 12997   CT 590870 2045459 0.4486875 [mi^2] POLYGON ((561322.1 4187901,...
## 15136   CT 590870 2045459 0.4486875 [mi^2] POLYGON ((561506.7 4189117,...

Let’s also use an overlay plot to check the output geometry.

tm_shape(census_tracts_ac_utm10) + 
  tm_polygons(col='gray', border.col='gray') +
tm_shape(cpad_in_ac) + 
  tm_polygons(col = 'ACRES', palette = 'YlGn',
              border.col = 'black', lwd = 0.4, 
              alpha = 0.8,
              title =  'Protected areas in Alameda County, colored by area')

st_intersects or st_intersection?

It really depends! But make sure you understand the difference.

st_intersects is a logical operator that returns True if two geometries intersect in any way. When used to subset (or filter) a spatial dataframe, st_intersects returns those features in the dataframe that intersect with the filter dataframe.

On the other hand, st_intersection returns a new spatial dataframe that set intersection of the two dataframes, including both the geometries and attributes of the intersecting features. Use st_intersection with caution and always check your results.

Exercise: Spatial Relationship Query

It’s your turn.

Write a spatial relationship query to create a new dataset containing only the BART stations in Berkeley.

Run the next two cells to (1) load the dataset containing Berkeley BART stations and then reproject it to the same CRS as that used by the Berkeley_utm10 dataframe (EPSG: 26910)’ (2) plot these two datasets in an overlay map.

Then, write your own code to: 1. Spatially select the BART stations that are within Berkeley 2. Plot the Berkeley boundary and then overlay the selected BART stations.

# load the Berkeley boundary
bart_df = st_read("notebook_data/transportation/bart.csv")
## Reading layer `bart' from data source `/Users/pattyf/Documents/Dlab/workshops/2021/Geospatial-Fundamentals-in-R-with-sf/notebook_data/transportation/bart.csv' using driver `CSV'
## Warning: no simple feature geometries present: returning a data.frame or tbl_df
bart_sf = st_as_sf(bart_df, 
                   coords = c('lon','lat'),
                   crs = 4326)
  
# transform to EPSG:26910
bart_utm10 = st_transform(bart_sf, st_crs(berkeley_utm10))

# display
head(berkeley_utm10)
## Simple feature collection with 1 feature and 9 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 559347.6 ymin: 4188961 xmax: 567371.4 ymax: 4195617
## Projected CRS: NAD83 / UTM zone 10N
##      STATEFP PLACEFP  PLACENS         AFFGEOID   GEOID     NAME LSAD    ALAND
## 1213      06   06000 02409837 1600000US0606000 0606000 Berkeley   25 27127391
##        AWATER                       geometry
## 1213 18715614 MULTIPOLYGON (((559347.6 41...

Plot the data together

plot(bart_utm10$geometry)
plot(berkeley_utm10$geometry, border='blue', add=T)

Your code here!

Now, in the cell below, write the code to spatially select the Berkeley BART stations, then make the map.

# YOUR CODE HERE:

# Spatially select the BART stations within Berkeley

# Plot the Bart stations in Berkeley overlaid on top of the Berkeley City Boundary

Solution hidden here!

To see the solution, inspect the text hidden below (or look in the 06_Spatial_Queries.Rmd file,line 519.)


6.3 Proximity Analysis

Now that we’ve seen the basic idea of spatial measurement and spatial relationship queries, let’s take a look at a common analysis that combines those concepts: promximity analysis.

Proximity analysis seeks to identify near features - or, in other words, all features in a focal feature set that are within some maximum distance of features in a reference feature set.

A very common workflow for this analysis is:

  1. Buffer around the features in the reference dataset to create buffer polygons.

  2. Run a spatial relationship query to find all focal features that intersect (or are within) the buffer polygons.


Let’s read in our bike boulevard data again.

Then we’ll find out which of our Berkeley schools are within a block’s distance (200 meters) of the bike boulevards.

bike_blvds = st_read('notebook_data/transportation/BerkeleyBikeBlvds.geojson')
## Reading layer `BerkeleyBikeBlvds' from data source `/Users/pattyf/Documents/Dlab/workshops/2021/Geospatial-Fundamentals-in-R-with-sf/notebook_data/transportation/BerkeleyBikeBlvds.geojson' using driver `GeoJSON'
## Simple feature collection with 211 features and 10 fields
## Geometry type: MULTILINESTRING
## Dimension:     XY
## Bounding box:  xmin: 561541.2 ymin: 4189007 xmax: 566451.6 ymax: 4193483
## Projected CRS: WGS 84 / UTM zone 10N
plot(bike_blvds$geometry)

Of course, we need to reproject the boulevards to our projected CRS.

bike_blvds_utm10 = st_transform(bike_blvds, st_crs(berkeley_utm10))

Now we can create our 200 meter bike boulevard buffers.

bike_blvds_buf = st_buffer(bike_blvds_utm10, dist=200)

Now let’s overlay everything.

tm_shape(berkeley_utm10) + 
  tm_polygons(col = 'lightgrey') + 
tm_shape(bike_blvds_buf) + 
  tm_polygons(col = 'pink', alpha = 0.5) +
tm_shape(bike_blvds_utm10) + 
  tm_lines() + 
tm_shape(berkeley_schools_spatial) + 
  tm_dots(col = 'purple', size=0.2)

Great! Looks like we’re all ready to run our spatial relationship query to complete the proximity analysis. At this point (pun intended) select the schools that are in within the bike boulevard buffer polygons.

schools_near_blvds = berkeley_schools_spatial[bike_blvds_buf,]

# or
#schools_near_blvds = berkeley_schools_spatial[bike_blvds_buf, , op='st_within']

Now let’s overlay again, to see if the schools we selected make sense.

tm_shape(berkeley_utm10) + 
  tm_polygons(col = 'lightgrey') + 
  
# add the bike blvd buffer polygons  
tm_shape(bike_blvds_buf) + 
  tm_polygons(col = 'pink', alpha = 0.5) +

# Add the bike blvd line features  
tm_shape(bike_blvds_utm10) + 
  tm_lines() + 

# Add all berkeley schools  
tm_shape(berkeley_schools_spatial) + 
  tm_dots(col = 'purple', size=0.2) +

# Add schools near bike blvds in yellow
tm_shape(schools_near_blvds) + 
  tm_dots(col = 'yellow', size=0.2)

Leveling up!

You can use st_distance and its companion function st_nearest_feature to compute the distance between each feature and the nearest bike boulevard. The st_nearest_feature function returns the ID of the closest feature.

# Identify the nearest bike boulevard for each school
nearest = st_nearest_feature(berkeley_schools_spatial,bike_blvds_utm10)

# take a look!
nearest
##  [1]  71 171  39 136  42  30 163 172 127 171 189 156 137  33 187 197  50 207 169
## [20] 102 125 137   3 109 207  76 135 173  56 102 146  76

Then we can calculate the distance between each school and it’s nearest bike boulevard.

st_distance(berkeley_schools_spatial, bike_blvds_utm10[nearest,], by_element=TRUE)
## Units: [m]
##  [1]  13.848161 985.459488 309.889446 369.402946 196.011379  15.332395
##  [7]  27.250406 439.758905 107.902846 926.216792 193.030072 181.836256
## [13] 373.736477 215.903128 184.321307 186.446907   1.288383  94.064527
## [19] 211.477566 218.613628 186.913116 230.212129  15.162313 188.829602
## [25] 232.764113 224.700672 173.920971  15.892361 514.765384  92.556921
## [31]  93.426741 128.131187

Exercise: Proximity Analysis

Now it’s your turn to try out a proximity analysis!

Write your own code to find all BERKELEY schools within walking distance (1 km) of a BART station.

As a reminder, let’s break this into steps:

  1. Spatially select the BART stations in Berkeley from the bart_utm10 dataframe
  2. Buffer your Berkeley BART stations to 1 km (HINT: remember your units!)
  3. Spatially select the schools within walking distance to the BERKELEY Bart stations.
  4. As always, plot your results for a good visual check!

To see the solution, look at the hidden text below.

Your code here

# YOUR CODE HERE:

# Spatially select the Berkeley Bart Stations
# You may have done this in a previous exercise.

# buffer the BART stations to 1 km

# spatially select the schools within the buffers

# Map your results with tmap
# plot the Berkeley boundary (for reference) in lightgrey

# add the BART stations (for reference) to the plot in green

# add the BART buffers (for check) in lightgreen

# add all Berkeley schools (for reference) in black

# add the schools near BART (for check) in yellow

Solution hidden here!

Bonus Exercise

Compute the distance between each Berkeley School and its nearest BART station!

#YOUR CODE HERE

6.4 Recap

Leveraging what we’ve learned in our earlier lessons, we got to work with map overlays and start answering questions related to proximity. Key concepts include:

  • Measuring area and length
    • st_area,
    • st_length
    • st_distance
  • Spatial Relationship Queries
    • st_intersects,
    • st_intersection
    • st_within, etc.
  • Buffer analysis
    • st_buffer

 D-Lab @ University of California - Berkeley
 Team Geo